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Abstract: In this paper, we annotate and align two different gene expression microarray designs using the Genomic ELement 
Ontology (GELO). GELO is a new ontology that leverages an existing community resource, Sequence Ontology (SO), to 
create views of genomically-aligned data in a semantic web environment. We start the process by mapping array probes to 
genomic coordinates. The coordinates represent an implicit link between the probes and multiple genomic elements, such 
as genes, transcripts, miRNA, and repetitive elements, which are represented using concepts in SO. We then use the RDF 
Query Language (SPARQL) to create explicit links between the probes and the elements. We show how the approach allows 
us to easily determine the element coverage and genomic overlap of the two array designs. We believe that the method will 
ultimately be useful for integration of cancer data across multiple omic studies. The ontology and other materials described 
in this paper are available at http://krauthammerlab.med.yale.edu/wiki/Gelo. 
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Introduction and Background 

The sequencing of the human genome ^'^ and subsequent annotation initiatives^'"* are creating a large 
body of information on genome accessibiUty (methylation and histone modifications), transcription 
(mRNA, ncRNA expression), and structural variations (such as inversion, duplication and translocation).^"^ 
The task of organizing such large volumes of data becomes increasingly complex, as do the subsequent 
analyses of the information. ^"^^ In an effort to catalog this and similar information, well over 1,000 
different databases are currently actively maintained in the realm of molecular biology.''^ The problem 
is that many of them are neither connected nor integrated.*^"'"'^^ 

The area of data integration using semantic web technologies remains under active development. '^"^^ 
Compared to more traditional relational database systems, the use of semantic web technologies 
simplifies data integration through W3C-supported knowledge representation standards such as Resource 
Description Framework Schema,'^ and Web Ontology Language.'^ A growing list of standardized 
vocabularies and data sources in RDFS and OWL, such as Gene Ontology (GO),^" Sequence Ontology 
(SO),^^ and other projects within the realm of Open Biological Ontologies (OBO),^^ allow the scientific 
community to move away from a plethora of home-built data models towards a situation where numerous 
data and knowledge bases share the same or related upper level schemas. This standardization of data 
models is desirable to facilitate the sharing of cancer data across multiple genomic data stores. Also, 
in the area of human genomics, where new facts and types of facts are discovered on a regular basis, 
a traditional relational model of storing data becomes less than optimal. For example, to include a new 
type of fact, a rigidly defined relational-database table would need to be updated with additional 
columns to accommodate the new information. In contrast, triple stores can easily add new properties 
to existing information by means of subject-predicate-object triples. Finally, an additional benefit of 
using semantic web technologies, albeit currently underutilized, involves the possibility of implementing 
reasoners that can logically infer relationships among the entities in the store. Triple stores allow for 
queries that are not easily performed in traditional databases, such as queries across hierarchies, as in 
ontologies. Reasoner software can also help in performing consistency checks over complex knowledge 
bases using logical rules. ^ 

In this study we discuss the use of semantic web technology for array annotation and alignment. 
Most of our data is derived from cancer microarray experiments. A critical step in the microarray data 
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integration process is the alignment of the different 
microarray designs to perform integrated analyses. 
Mapping of array probes to genomic coordinates 
is essential for this task. The coordinates represent 
an implicit link between the probes and multiple 
genomic elements such as genes, transcripts, 
miRNA, and repetitive elements which are anno- 
tated using concepts from Sequence Ontology 
(SO).^^ By creating explicit links between the loci 
of genomic elements, we are able to derive 
which probes and elements align. The mapping of 
probes to elements achieves two goals: first, it links 
probes to gene transcripts (elements in SO), 
allowing for the re-annotation of the array design 
with the transcripts covered. Second, we can estab- 
lish the overlap between the probes of two differ- 
ent array designs, establishing the degree of 
alignment. 

Materials and Methods 

GELO 

Our project provides a unique approach to linking 
various data in the area of molecular biology 
using semantic web technologies. Unlike other 
approaches, such as Bio2RDF''^ that rely on 
database identifiers, names, and synonyms to link 
information, we use the genomic coordinates as 
a biologically-meaningful scaffold to attach and 
align information. Creating synonyms to link 
disparate sources of data is a useful approach but 
it requires time-consuming manual curation. Our 
approach allows the system to automatically infer 
that any two elements are the same if they map 
to the same coordinates in a particular genome 
build. Consequently, any annotation pertinent to 
one element can be applied to the other as well. 
Additionally, each genomic element can be 
automatically represented in the context of other 
elements by means of relationships such as 
"fl upstream_of 6", "c on _the _same _strand _ 
as or contains /". This allows for complex 
queries such as for exons that are contained in a 
particular transcript. 

Our Genomic ELement Ontology (GELO) 
(Table 1 and Fig. 1) is loosely based on an Open 
Biological Ontology^^ project called Sequence 
Ontology (SO),^^ a standard for annotating regions 
of the human genome. "Region ", an (incomplete) 
sub-branch of SO, is used as the basis of the GELO 
ontology. Our '•'•GenomicElement\ a superclass 
of SO's '''region", subsumes all terms predefined 



in SO (e.g. '''repetitive element", "ncRNA", 
etc.). The class "GenomicElemenf is flexible 
enough, however, so that it could be used to 
conceptually represent any of the following: "the 
entire genome" , "a single chromosome", "a band 
on a chromosome" , "an n-megabase-long 
region", "a specific gene", "an exon", or 
"a unique 50-mer within the ejcon". Anovel class 
"GenomicLocus" was created to provide a facility 
to link any "GenomicElement" to its sequence 
in a particular assembly of the human genome 
(Table 1). The relationships "locus _of' and 
"hasjocus" were defined to link "GenomicElement " 
with its biological coordinates stored in a 
"GenomicLocus" . To describe the relative 
position of two instances "GenomicLocus" in 
the genome, "'contains''' and contained Jby 
(subclasses of ""has jjart" and "part_of\ already 
defined in SO) were defined in GELO. Several 
other properties will be defined to facilitate the 
relative positioning of the regions: transitive 
"upstreamjDf and its inverse: " downstream _of\ 
symmetric "on_the_same_strand_as" with 
an analogous "on_the_opposite_strand J'roni", 
symmetric "overlaps _with'\ and so on. The 
proof of concept described in this manuscript 
relies only on the relationships "contains'' and 
" contained _by\ As our repository grows, other 
relationships not discussed in this manuscript will 
be added as well. 



Knowledge base 

We describe the process used to construct our 
knowledge base using GELO and a set of genomic 
sequences. Two sets of sequences were used, both 
being lists of probes from commercial Nimblegen 
microarrays. The two sets indicate an evolution 
of the microarray design as the first one was gen- 
erated in 2005 (2005-04-20_Human_60mer_lin2 
array) and the second in 2006 (2006-08-03_ 
HG18_60mer_expr). Both sets of probes are 
available within Nimblegen design files. The first 
step was to take the sequences of the probes 
and strip them of any existing annotation. The next 
step involved mapping the sequences to the 
most current build of the human genome (hgl 8,)^^ 
using the BLAST-Like Alignment Tool (BLAT).^"^ 
We set BEAT to find probe alignments with 50% 
and better similarity scores. A resulting PSL 
file was then parsed using a custom python 
script, which converted the tabular format into 
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Table 1. GELO, ENTREZ and PROBE helper ontologies. The three ontologies link to classes in sequence 
Ontology. The helper ontologies define 3 new classes, highlighted in bold. 



Ontology 


Domain 


Relationship 


Range/Type 


GELO 


GELO:GenomicElement 


rdf:type 


owhclass 




GELO:GenomicElement 


GELO:has locus 


GELO:GenomicLocus 




GELO:GenomicElement 


SO:has_part 


GELOiGenomicElement 




GELO:GenomicLocus 


rdf:type 


owl:class 




GELOiGenomicLocus 


GELO:start 


integer 




GELOiGenomicLocus 


GELO:end 


integer 




GELOiGenomicLocus 


GELO:strand 


+/-/* 




GELOiGenomicLocus 


GELO:on_chromosome 


SO:Chromosome 




GELO:GenomicLocus 


GELO:in_assembly 


SO:Assembly 




GELO:GenomicLocus 


GELO:locus_of 


GELOiGenomicElement 


ENTREZ 


SO:Gene 


ENTREZ:has_transcript 


SO:Transcript 




SO:Gene 


ENTREZ:qene symbol 


string 




SO:Gene 


ENTREZ:synonym 


string 




SO:Gene 


ENTREZ:entrez_gene_id 


string 




SO:Gene 


ENTREZ:description 


string 




SO:Gene 


ENTREZ:taxonomy_id 


string 




SO:transcript 


ENTREZ:transcript_OT 


SO:Gene 




SO:transcript 


ENTREZ: RNA_ 
nucleotide_accession 


string 


PROBE 


PROBE:ProbeSet 


rdf:type 


owl: class 




PROBE:ProbeSet 


SO:has part 


SO:Probe 


SO 


SO:Region 


rdf:subclass of 


GELO:GenomicElement 




SO:Gene 


rdf:subclass_of 


SO:Region 




SO:Transcript 


rdf:subclass_of 


SO:Region 




SO:lntron 


rdf:subclass_of 


SO:Region 




SO:Exon 


rdf:subclass_of 


SO:Region 




SO:miRNA 


rdf:subclass_of 


SO:Region 




SO: Probe 


rdf:subclass_of 


SO:Region 



Subject-Predicate-Object triples, stored in the 
N-Triples format. The N-Triples format was 
chosen because of its simplicity in comparison to 
other formats such as RDF/XML,^^ Turtle" or N3.^^ 
Within the N-Triples representation, all sequences 
have been represented as individuals of RDF: type 
''probe'' (defined in SO as ''SO_000005r) and aU 
BLAT matches were represented as of RDF: type 
''GenomicLocus'' annotated with genomic 
coordinates found by BLAT. All ''GenomicLocus" 
individuals were linked to their appropriate 
"probe" individuals using '' locus _of relationship 
from GELO. To ease the retrieval of probe 



locations from one design file over others, the 
concept ''ProbeSef' was defined in a separate 
(helper) ontology called Probe (Table 1). Two 
instances of ""ProbeSef were created, one 
aggregating probe sequences from the 2005 design 
file, the other aggregating probe sequences from 
the 2006 design file. The ''part_of' relationship 
was used to link particular probes with their cor- 
responding probe design. 

At the time of publication our knowledge base 
contained information about genes, their transcripts, 
and the locations of introns and exons. The import 
of gene information was performed as follows. 
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Probe 


^ ^ SO 

^ 4iii!^I^fS^^S>^:^^j5> 

(rSO_000141l]^"'^ gWW - - " 


qtJJ-a 


-^•nornkflifnti^^-'^ 

(^rwmlcLocuO 

GELQ 


I^SOMOMWii-, is-a ^ :. ^ ^ , ^ --^ 

^--JL^--^ Qo_00006K^CI-ii^^ — Qo_0000696]^<l-li^i — Q^^WPOSO 



Figure 1. A graph representing tine classes of Sequence Ontology (SO), GELO and PROBE. Only a subset of SO classes are shown. Labels 
correspond to SO's label property for nodes. 



A list of FASTA files containing the sequences of 
all human transcripts was acquired from the Refseq 
database at NCBI (RefSeq release 33).^^ The 
information linking the genes with gene names, 
gene symbols, synonyms, etc, was acquired from 
two files: homo sapiens.gene info and gene2refseq 
(both available via FTP from NCBI's gene data- 
base). ^° The sequences were aligned with the latest 
build of the human genome (hgl 8)^^ using BLAT.^"^ 
Subsequently the tabular output files of BLAT 
together with transcript and gene annotation from 
NCBI were converted to N-Triples representation 
using a custom python script. 

Within the N-Triples representation all known 
genes were defined as rdf: type ^'80 0000704 " and 
all known transcripts using rdf: type "50 0000673" 
(both defined in SO). All BLAT matches were 
represented using rdf: type ''GenomicLocus" 
defined in GELO. Finally all gene individuals 
were linked to their respective transcripts using 
"hasjranscripf relationship (using an ENTREZ 
helper ontology created by our group to augment 
GELO with gene-specific relationships, see 
Table 1), and all transcripts were linked to their 
appropriate ''GenomicLocus" (i.e. BLAT mappings) 
using the ''has_locus'' relationship defined in 
GELO. Thus, an indirect link from genes to their 
respective locations was achieved. As every line 
of the psl file contains information about "block 
start" and "block end," indicative of the intron-exon 
structure of the transcript, this information was also 
included in the N-Triples file, whereas the introns 
were created as SO-defined ''SO_0000188" 
instances and exons as instances of ''SO_000014T\ 
Each intron and exon were linked to their respec- 
tive BLAT-determined ''GenomicLocus" instances 
via the ''hasjocus" relationship. 

We decided to use BigOWLIM as our storage 
system based on published and unpublished LUBM 
Benchmarks.^' Additionally, BigOWLIM uses the 



Sesame API which was successfully used in other 
semantic web projects. ^^^^"^ 

After loading all elements (probes, genes, 
introns, exons) and their respective loci on the 
chromosomes, we needed to determine which 
elements' loci overlap along the chromosome. We 
defined an "a contains b" semantic relationship as 
a relationship between any two individuals of class 
''GenomicLocus" such that an entire genomic 
sequence of an individual b can be found within 
the sequence of an individual a. As we were mainly 
interested in exploring the short 60-mer probe 
sequences in the context of their belonging to 
relatively long transcript sequences, we focused 
on the ''contains'' relationship only. Other 
relationships, such as ""overlaps'", although equally 
important, were assigned a lower priority and will 
be added to the repository in the future. 

To potentially link two loci using the ''contains" 
relationship, the knowledge base was queried using 
a SPARQL^^ expression (Fig. 2) to construct a new 
graph linking pairs of loci. The rule engine of 
OWLIM allows to create logic rules equivalent to 
the SPARQL expression listed in Figure 2. 
Currently, OWLIM rules do not support "bigger 
than" or "smaller than" constraints, but future ver- 
sions will do so (Personal Communication). The 
idea is that the rule engine will infer the "contains" 
and other relationships of GELO automatically 
upon insertion of new "GenomicLocus" data into 
the knowledge base. At this time, we resorted to 
constructing sub-graphs using SPARQL queries. 

A simple validation of the repository was done 
by probing the genomic vicinity of the NFkB 1 gene. 
A query was issued to retrieve all probes and their 
locations for all known transcripts of the NFkBI 
gene. Figure 3 shows the region with all probes 
retrieved plotted using the UCSC genome browser.^'' 
A comparison between 1) a list of probes that was 
originally associated with the gene during the array 
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PREFIX GELO: < http://krauthammerlab.med. vale.edu/ontologies/gelo# > 
PREFIX rdf: < http://www.w3.Org/1999/02/22-rdf-syntax-ns#> 

CONSTRUCT 
{ 

?x GELO: contains ?y. 

?y GELO: contained_by ?x. 

} 

WHERE 
{ 

?x rdf: type GELO: GenomicLocus. 
?y rdf: type GELO: GenomicLocus. 
?x GELO: on_chromosome ?chr. 
?y GELO: on_chromosome ?chr. 

?x GELO: locus_start ?x_start. 
?x GELO: locus_end ?x_end. 

?y GELO: locus_start ?y_start. 
?y GELO: locus_end ?y_end. 

FILTER ((?x != ?y) && (?x_start <= ?y_start) && (?x_end >= ?y_end)) 

} 

Figure 2. SPARQL query for finding GenomicLocus instances x and y, such that x contains GenomicLocus y. 



design process, and 2) the list of probes retrieved 
from our repository showed the following: first, our 
repository correctly mapped all NFkBI probes from 
the 2005 and 2006 designs to the NFkBI locus. 
Second, our list contained two extra probes over 



the original design. Further examination of the extra 
probes revealed them to be probes designed for 
other genes that imperfectly matched the NFkBI 
locus. Adjusting the threshold of "blat_match_ 
score'' removed the imperfect BLAT matches. 



chr1(q24) 



HSAP0DPD00D1077Za | I ' 
HSAPO0POOOO1 07730 [»^ 
HSAP00P0D00107716| 



IVl5Be03PD232S — 



BC051765P02399 



M58603P0Z5Z3 | 
BC051765P02596| 



Nimblegen 2005-04-ZO_Human_60mer_1inZ Array Design 
HSAPOOP0000107731 | 
HSAPOOP0000107717| 

HSAPOOP000010773Z| I ! | 

HSAP00P0D00107718 | 

HSAP00P0000107733| 
HSAP00P0000107719 



Mill 



Nimblegen 2Q06-08-Q3_HGl 8_5Dmer_expr Array Design 
Nivl_003998P0Z833 
1V158603POZ760 
BC0517B5P0Z833 

NM_003998P0Z969 
lvi58603P0ZSg6 | 

NM_D0399BP03D7B 
IVI5B603P03005 
BC0517B5P03078 



RefSeq Genes 



HSAP0DP00D0107734| 
HSAPO0POODO1O77Z0 j 
iHSAPDOP00001D7735| 
HSAPD0P0DD01D77Z1 | 
HSAPOOP00001077ZZ| 
HSAPOOP00001 07737 | 
HSAP00P0D001077Z3 f 
HSAPOOP0000107738 | 
HSAPDDPD00D1077Z4| 
HSAPOOP0000107739 | 
HSAPOOP00001077Z6 j 

NiVi_0D3998P03336 | 
ivi5B603P03Z63 j 
BC051765P03336 j 
Nlvl_003998P034B7 | 
iv15B603P03394 j 
BC0517G5P03467 j 
NM_00399BP03555 | 
j I I I IVI5BB03P0348Z | 
1 M I BCD517B5P03555 j 
NM_00399BP03593 | 
N(ViJ0399BP03683 | 
BC051765P03683 j 



))>>)>>>>>)> 



Figure 3. Probes from Nimblegen 2005-04-20_Human_60mer_1in2 and 2006-08-03_HG18_60mer_expr array designs that map to the 
NFkBI locus. Duplicate probes with the same position share the same sequence, but were assigned to separate nucleotide sequence 
accessions in the original design files. 
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The ontologies described in this paper can be 
accessed at http://krauthammerlab.med.yale.edu/ 
wiki/Gelo. 



Results and Discussion 

The goal of our knowledge base is the alignment 
of genomic data from cancer high-throughput 
experiments. We currently work with melanoma 
gene expression data from two different array 
designs, and we are interested in aligning the 
results of both designs. Having constructed our 
knowledge base of genes and their genomic 
locations, we attempted to re-annotate the 
sequences of the two microarray designs, the 2005 
design (2005-04-20_Human_60mer_lin2) with 
383,468 probe sequences, and the 2006 design 
(2006-08-03_HG18_60mer_expr) with 381,002 
probe sequences, and to determine the genomic 
overlap. 

Figure 4 shows the SPARQL query used to align 
the two design files and determine how many 
individual transcripts and genes are probed in each 
of the designs. 

The Venn diagram in Figure 5 illustrates the 
query result. Not surprisingly, the 2006 design 
features 1947 new genes that were not included in 
the previous year's design (to produce the graph 
from the resuhs of the SPARQL query, we used 
Python and R). A further examination of the 
genomic overlap among the two design files 
revealed 365 genes that were not included in the 
newer design. The differences among the design 



files could reflect the changes in the assembly of 
the human genome sequence as well as changes in the 
annotation of the sequences provided by Refseq.^^ 
The alignment of the two design files will enable 
user to determine which of the gene-specific probe 
sets can be compared between the two different 
designs. 

Next, we investigated the number of probes that 
are contained within each of the genes. An 
overwhelming majority of the genes had 10 probe 
sequences assigned to them in the 2005 design and 
8 for the 2006 design. This agreed with the prior 
knowledge about these microarray designs. The 
histograms in Figures 6A and 6B revealed a 
periodicity in probe count distribution. For 
example, there are several genes for which 20, 
30, etc., probes were selected in the 2006 design. 
This is a reflection of the number of transcripts 
covered per gene. The question is whether some 
of these probes are duplicates. To address this 
question, we investigated one of the genes, NFkB 1 , 
which had a probe count corresponding to two 
transcripts. The illustration in Figure 3 shows the 
NFkBI locus and the different array probes. It 
is evident that quite a few of the probes over- 
lap: -7734 and -7720, -7735 and -7721, and so 
on. A further examination of the repository revealed 
that NFkBI is linked to two (2005 design) and 
three (2006 design) uniquely identified, although 
completely identical, transcripts. The probe 
sequences were selected for each of the transcripts 
independently, possibly without acknowledging 
that they were, in fact, the same. As a result, in the 



PREFIX GELO: < http://krauthammerlab.med.vale.edU/ontoloqies/qelo# > 
PREFIX SO: < http://purl.orq/obo/owl/SO# > 

PREFIX ENTREZ:< http://krauthammerlab.med.vale.edu/ontoloqies/entrez# > 

PREFIX REL: < http://purl.orq/obo/owl/obo# > 

PREFIX NIMBLEGEN: < http://nimbleqen.eom/microarrav# > 

SELECT distinct ?gene_symbol 

WHERE 

{ 

?loc_b GELO: contains ?loc_a. 
?loc_a GELO: locus_of ?a. 
?a rdf: type SO: SO_0000051 . 

?a REL: part_of NIMBLEGEN: 2006-08-03_HG18_60mer_expr. 

?loc_b GELO: locus_of ?b. 

?b ENTREZ: transcript_of ?gene. 

?gene ENTREZ: gene_symbol ?gene_symbol. 

} 

Figure 4. A sample SPARQL query displaying a list of genes with transcripts containing at least one probe. 
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A) 



2005 2006 




39020 

Figure 5. Differences among the two microarray design files. A) Nui 
with a transcript that contains at least one probe. 



2005 design, we observe duplicates of certain 
probes, and in the 2006 design, triplicates of 
probes. Querying our knowledge base revealed that 
several different locations on the microarray sur- 
face store the same probe sequence. A researcher 
can use the information provided by the knowledge 
base and compare the microarray surface locations 
storing the same probe sequence to detect 
variability in the microarray data. 

The alignment of the two design files can now 
be used to revise and supplement the incomplete 
annotation of the original design files. Specifically, 
a closer look at the original annotation included in 
the 2006 design revealed that probes were designed 
for 41,621 unique transcripts identified by either 
Refseq ids (20,590 transcript) or GenBank 
accession ids (21,031 transcripts). Unlike the 
GenBank accession ids, the Refseq ids correspond 
to well-curated consensus messenger RNA 
sequences. A query to our knowledge base showed 
that probes which were originally mapped to the 
20,590 Refseq sequences are re-mapped to 24,079 
Refseq ids. The 21,031 transcripts with GenBank 
accessions ids are re-mapped to 17,332 refseq 
transcripts. Overall, the 41,621 transcript ids were 
re-mapped to 24,644 unique Refseq ids (versus 
20,590 in the original design) based on probe 
sequence alignment to the human genome. 

Figures 6C and 6D show two histograms 
depicting the uniqueness of probes with respect to 
genes. As expected, the majority of the probes in 
both 2005 and 2006 designs are unique, i.e. they 
report on expression level of just one sequence. 



B) 



2005 2006 




24742 

of transcripts that contain at least one probe. B) Number of genes 



However, the skew of the distributions suggests 
the presence of many "noisy" probes whose 
sequences match more than one, and sometimes 
even more than two or three genes. The re-annotation 
of the 2005 and 2006 design files can report on 
how "promiscuous" any given probe is, which is 
useful for signal normalization and de-noising of 
the microarray data. 

Another aspect of our knowledge base is the 
inclusion of information describing the polarity of 
probe and gene sequences. Figures 6E and 6F show 
the presence of probes in gene regions where the 
probes are on the opposite strand from the gene. 
The polarity of probe sequences with respect to 
gene sequences may be relevant for some experi- 
mental designs. Alternatively, in an experimental 
design where the relative position of the probe 
should not matter, the repository can be queried to 
find additional probes that, although anti-sense with 
respect to the gene, can be examined to further 
strengthen the evidence coming from the other, 
correct-sense probes. 

We would also like to discuss the performance 
of the knowledge base, which currently stores over 
39,000,000 exphcit statements (triples). The query 
in Figure 4 took 82,863 ms (1 .45 min) to complete 
and returned 1,253,878 statements mapping probes 
and transcripts for the 2005 design. The same query 
for the 2006 design completed within 102,000 ms 
(1.7 min) with 1,714,030 statements returned. 
A more complex query to return all probes mapped 
to the NFkBI gene (Fig. 3) took 411,692 ms 
(6.86 min) to return 41 statements. Semantic web 
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Figure 6. Relationship between genes and probes. A) 2005 design, number of probes contained_by each gene where both gene and probe 
are located on the same strand of DNA. The highlight in green indicates a strategy to design 10-11 probes per gene. B) 2006 design, number 
of probes contained_by eacU gene where both gene and probe are located on the same strand of DNA. The highlight in green indicates a 
strategy to design 8 probes per gene. C) 2005 design, number of genes probed by each probe, both gene and probe are located on the 
same strand of DNA. D) 2006 design, number of genes probed by each probe, both gene and probe are located on the same strand of DNA. 
E) analogous to A, the probe and the gene are located on the opposite strands of DNA, F) analogous to B, the probe and the gene are 
located on the opposite strand of DNA. 



technologies are evolving, and the time it takes 
to complete the queries will surely decrease in the 
future. For the purpose of our research, however, 
the response time was satisfactory. 

Overall, our knowledge base provides a 
biologically meaningful framework for the 
examination of genomic data. The potential of the 
semantic web to link virtually any piece of infor- 
mation in the context of its genomic location 
provides an attractive strategy for data integration 
and analysis in the 21st century. 
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